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Abstract: Residual variance and the signal-to-noise ratio are important quantities in 
many statistical models and model fitting procedures. They play an important role in re- 
gression diagnostics, in determining the performance limits in estimation and prediction 
problems, and in shrinkage parameter selection in many popular regularized regression 
methods for high-dimensional data analysis. We propose new estimators for the residual 
variance, the £^-signal strength, and the signal-to-noise ratio that are consistent and 
asymptotically normal in high-dimensional linear models with Gaussian predictors and 
errors, where the number of predictors d is proportional to the number of observations 
n. Existing results on residual variance estimation in high-dimensional linear models 
depend on sparsity in the underlying signal. Our results require no sparsity assumptions 
and imply that the residual variance may be consistently estimated even when d > n 
and the underlying signal itself is non-estimable. Basic numerical work suggests that 
some of the distributional assumptions made for our theoretical results may be relaxed. 
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1. Introduction 

Consider the linear model 

yi = xjf3 + ei, 2 = l,...,n, (1) 

where yi,...,yn G M and Xi = {xu, ...,Xid)'^ , ■■.,^n = {xni, ...,Xnd)'^ G are observed out- 
comes and d- dimensional predictors, respectively, ei, G M are unobserved iid errors with 
E{ei) = and Var(ei) = cr^ > 0, and f3 = {Pi, fSd)'^ G M.'^ is an unknown rf- dimensional 
parameter. To simplify notation, let y = {yi, ...,yn)'^ € denote the n-dimensional vector of 
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outcomes and X = (xi, x„)"^ denote the nxd matrix of predictors. Also let e = (ei, e„)"^. 
Then (1) may be re-expressed as 

y = X/3 + e. 

In this paper, we focus on the case where the predictors Xj are random. More specifically, we 
assume that xi, x„ are iid random vectors with mean and dxd positive definite covariance 
matrix S (many of the results in this paper are applicable if -E(xj) ^ upon centering the 
data; however, this is not pursued further here). 

Let = 0^ S[3 = \\I]^^'^(3\\'^ , where || ■ || denotes the £^-norm. Then is a measure of 
the overall signal strength. The residual variance o"^ = Var(ej) = Var{i?(?/j|xi)} and the 
signal strength are important quantities in many problems in statistics. For example, in 
estimation and prediction problems, typically determines the scale of an estimator's risk 
under quadratic loss. More broadly, cr^, r^, and associated quantities, such as the signal-to- 
noise ratio r^/cr^, all play a key role in regression diagnostics. Thus, reliable estimators of cr^ 
and are desirable. 

For invertible X'^X, let (3^1^ = {X'^X)~^X'^'y be the ordinary least squares estimator for 
f3. li n — d —> oo, then 

is a consistent estimator for cr^ and, under fairly mild additional conditions, is asymptotically 
normal. Consistent estimators for can also be constructed. For instance, if n — (i — ?■ oo, it 
is easily seen that 

= -llyir -^' = -^^llyll' + -^y^xix^xr^x^y (3) 

n n[n — d) n — d 

is a consistent estimator for under mild conditions. 

It is more challenging to construct reliable estimators for o"^ and in high-dimensional 
linear models, where d > n. Indeed, if c? > n, then the estimator (Tq breaks down; however, 
estimating cr^ and remains important. In high- dimensional linear models with d > n, 
plays an important role in selecting effective shrinkage parameters for many popular reg- 
ularized regression methods (Bickel et al., 2009; Candes and Tao, 2007; Zhang, 2010). The 
signal-to-noise ratio r^/a^ is also important for shrinkage parameter selection, and it deter- 
mines performance limits in certain high-dimensional regression problems (Dicker, 2012a,b). 

In this paper, we propose new estimators for cx^ and that are consistent and asymptoti- 
cally normal, with rate ra"^/^, in an asymptotic regime where d/n — )■ p G [0, oo) (whenever we 
write (i/n — > p, it is implicit that n — )■ oo as well). We also show that these estimators may 
be used to derive consistent and asymptotically normal estimators for function sof and 
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r^, like the signal-to- noise ratio. Previous work on estimating cr^ in high-dimensional linear 
models where d > n has been conducted by Sun and Zhang (2011) and Fan et al. (2012). 
These authors assume that f3 is sparse (e.g. the £^-norm or ^°-norm of (3 is small) and their 
results for estimating o"^ are related to the fact that /3 itself is estimable under the specified 
sparsity assumptions. Though Sun and Zhang's (2011) and Fan et al.'s (2012) results even 
apply in settings where d/n — )■ oo, their sparsity assumptions may be untenable in certain 
instances and this can dramatically affect the performance of their estimators. In this paper, 
we make no sparsity assumptions (however, o"^ and are required to be bounded) and we 
show that the proposed estimators for o"^ and perform well in situations where d > n and /3 
is provably non-estimable. This is one of the main messages of the paper: Though some type 
of sparsity is required to consistently estimate f3 in high- dimensional linear models, sparsity 
in /3 is not required to estimate cx^ and r^. 

1.1. Distributional assumptions 

Though sparsity is not required in this paper, we do make strong distributional assumptions 
about the data. In particular, we henceforth assume that 

ei,...,e„~iV(0,a2) and xi, x„ ~ iV(0, T). (4) 

While normality is used heavily throughout our analysis, we expect that key aspects of many 
of the results in this paper remain valid under weaker distributional assumptions. This is 
explored via simulation in Section 4. 

Not surprisingly, the analysis in this paper is simplified by the normality assumption (4). 
To explain the relevance of (4) in more detail, we first point out that our primary consistency 
results for the proposed estimators of and (Theorem 1 below) follow from exact cal- 
culations of the estimators' mean and variance. If the normality assumption (4) is violated, 
then these calculations are generally invalid; similar techniques may be applicable, if other 
conditions hold, but exact finite sample calculations are not likely to be possible and any 
corresponding approximation may be more involved. 

The normality assumption (4) also facilitates the use of a collection of "soft-tools" for 
random matrices developed by Chatterjee (2009) to prove that the estimators proposed in this 
paper are asymptotically normal. These tools are related to second order Poincare inequalities 
and Stein's method (Stein, 1986). Asymptotic normality for the proposed estimators follows 
by bounding the total variation distance to a normal random variable. These bounds contain 
information about how the variability of the proposed estimators may depend d, n, E, cx^, 
and r^. This is easily leveraged to obtain consistent and asymptotically normal estimators for 
functions of cr^ and (such as the signal-to-noise ratio, r^/cr^; see Corollary 2 below), which is 
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an important practical objective. Thus, one of the appeahng aspects of the "soft tools" used 
in this paper is their flexibility. On the other hand, paraphrasing Chatterjee (2009), other 
existing methods for asymptotic analysis in random matrix theory rely heavily on the exact 
calculation of limits (Bai and Silverstein, 2004; Jonsson, 1982); we suggest that this may be a 
more delicate endeavor in some instances. If the normality assumption (4) does not hold, then 
it is unclear if the soft tools used in this paper are still applicable and, consequently, other 
techniques may be required. Existing work in random matrix theory suggests that this may 
be possible (see, for example, (Bai et al., 2007; El Karoui and Koesters, 2011; Pan and Zhou, 
2008)); however, the computations are likely more involved and the breadth of applicability 
of alternative techniques seems unclear. 

1.2. Correlation among predictors 

Another challenging issue for estimating cr^ and when d> n involves the covariance matrix 
Cov(xj) = E . Our initial estimators for cr^ and are devised under the assumption that 
E is known (equivalently, S = I; see Section 2). These estimators are unbiased, consistent, 
and asymptotically normal. We subsequently propose modified estimators for o"^ and in 
cases where E is unknown, but (i) a norm-consistent estimator for E is available, or (ii) E 
and f3 satisfy certain conditions described in Section 3.2. If a norm-consistent estimator for 
E is available, then the proposed estimators for and are consistent; if, furthermore, E 
is estimated at rate o{n~^^'^), then the estimators are asymptotically normal. On the other 
hand, ii d/n ^ p E (0, oo), then norm-consistent estimators for E are not generally available 
(though there are important examples where norm-consistent estimators for E can be found 
- this is discussed in more detail in Section 3.1). Thus, it is important to construct estimators 
for 0"^ and that perform reliably when E is completely unknown. While it remains an open 
problem to find estimators for and that are consistent for completely general E, in 
Section 3.2 we propose estimators that are consistent and asymptotically normal, provided 
E and f3 satisfy conditions that are closely related to other conditions that have appeared in 
the random matrix theory literature (Bai et al., 2007; Pan and Zhou, 2008). These conditions 
basically require that f3 and E are asymptotically free in the sense of free probability (see, for 
example, (Speicher, 2003) for a brief overview of free probability and random matrix theory). 

1.3. Additional remarks 

The problems considered in this paper have at least a passing resemblance to the Neyman- 
Scott problem (Lancaster, 2000; Neyman and Scott, 1948). In a simplified version of this 
problem, observations Wij ~ N{pi,h''^), i = l,...,n, j = 1,2 are available, and the goal is to 
estimate a^. The means /ij are nuisance parameters and, without additional specification, none 
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of the /ij are estimable, as n — )■ oo. Furthermore, the profile maximum likelihood estimator 
for i^^, which is given by 



is inconsistent; indeed, lim„^oo ^'I/le = v"^ /I. On the other hand, the simple method of mo- 
ments estimator C'mqm — '^'^mle is consistent for z/^ and asymptotically normal. 

In linear models (1) with d > n, which are the main focus of this paper, the parameter f3 is 
typically non-estimable. However, we show below that cr^ may still be consistently estimated 
in a variety of circumstances. Moreover, as in the Neyman-Scott problem, it is unclear how 
to proceed with likelihood inference. Indeed, the MLE 



is degenerate when d > n and it can even be troublesome when d < n: ii d/n ^ p & (0, 1), 
then 0"^ — > (1 — p)a'^ ^ a^. Furthermore, similar to the Neyman-Scott problem described 
in the previous paragraph, the basic estimator for cr^ derived in Section 2.1 is a method of 
moments estimators. 

In our view, the major implication of the preceding discussion is that the ambiguities 
of likelihood inference which arise in this problem contribute to difficulties in devising a 
systematic approach to estimation and efficiency when studying a^, r^, and related quantities 
in high-dimensional linear models. While the estimators proposed in this paper are shown to 
have reasonable properties, further research into these broader issues may be warranted. 

1.4- Overview of the paper 

Section 2 is primarily devoted to the case where Cov(xj) = I. A motivating discussion and 
the definition of the basic estimators for o"^ and may be found in Section 2.1. Section 
2.2 and Section 2.3 address consistency and asymptotic normality for the basic estimators, 
respectively. The case where Cov(xj) = is unknown is addressed in Section 3. Section 3.1 
is concerned with the case where a norm-consistent estimator for S is available; Section 3.2 
covers the case where no such estimator may be found, but /3 and S satisfy certain additional 
conditions. The results of three simulation studies are reported in Section 4. Two of these 
studies illustrate basic properties of the estimators proposed in this paper. In the third study, 
we compare the performance of our estimators for to the performance of estimators for 
cr^ proposed by Sun and Zhang (2011). Section 5 contains a concluding discussion, where 
we briefly mention some potential alternatives to the estimators proposed in this paper and 
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issues related to efficiency. Proofs may be found in the Appendix; some of the more extended 
calculations required for these proofs are contained in the Supplemental Text (which may be 
found after the Bibliography below). 



2. Independent predictors: U = I 

Throughout the discussion in this section, we assume that S = I. All of the calculations 
in Section 2.1-2.2 require E = I. However, the main result of Section 2.3 (Theorem 3, on 
asymptotic normality) holds for arbitrary positive definite E . Notice that if 7^ J, but S is 
known, then one easily reduces to the case where = J be replacing X with XS^^/'^. 



2.1. Motivation and the basic estimators 



For illustrative purposes, suppose for the moment that d <n. The estimator (Tq, defined in (2), 
may be interpreted as the projection of y onto co\{X)-^ C M", the orthogonal complement 
of the column space of X. This well-known interpretation highlights one of the obstacles 
to estimating cr^ in linear models with more predictors than observations: If (i > n, then 
col(X) = M"; thus, col(X)^ = {0} and any projection onto col(X)-'- is trivial. An alternative 
interpretation of cTq suggests methods for estimating o"^ and in high-dimensional linear 
models. 

Consider the linear combination of ?T'~^||y|P and n'^y"^ X{X'^ X)~^X'^y, 

i^o(ai,a2) = ai-||y||2 + a2-y^X(X^X)-iX^y 
n n 

for ai, 02 € M and observe that 

^(i||ylP) ^ + (5) 
E\-y^X(X^X)--^X^y\ = -a' + (6) 

are non- redundant linear combinations of and r^. Since 

ELo(ai,a2) = a,E(^^\\y\\^^+a2E!^^y^X{X^Xr'X^y 

= ai{a^ + t"^) + a2 ( -a^ + 
\ n 
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it follows that there exist aii,ai2 € IR such that Lo(aii,ai2) is an unbiased estimator of cr^, 
i.e. ELQ{aii,ai2) = cP' . In particular, we have 



EU ( 1, 1 = <y 

n — a n — a 



2 



and, moreover, cTq = Lo{ri/(n — (i), —n/{n — d)}. Thus, for d < may be viewed as the 

unique linear combination of ?^~^||y|P and n~^y^ X{X'^ X)~^X^y that yields an unbiased 
estimator of a^. 

The identities (5)-(6) also imply that there exist 021,022 G K such that 1^0(021,022) is an 
unbiased estimator for r^. Indeed, 

„^ 1 d n \ 9 

EU\ -) =T^ 

n — d n — d I 



and 

Tn = Ln 



d n 



n — d n — d 



is the estimator defined initially in (3). 

The ideas above are easily adapted to a more general setting that is useful for problems 
where d > n. Broadly, we seek statistics Ti = Ti(y,X) and T2 = T2(y,X) such that 

^S^^J " ^""^2 ^ ^^^2 for some constants G M with 611&22 - &12&21 7^ 0. (7) 

E[l2) = 02l(y + O22T 02l!C'22 

In other words, the expected value of the statistics Ti, T2 should form a pair of non-degenerate 
linear combinations of o"^ and r^. If such Ti and T2 can be found, then unbiased estimators 
for cr^, may be formed by taking linear combinations of Ti and T2. Moreover, asymptotic 
properties of these estimators are determined by the asymptotic properties of Ti, T2. 

In the example discussed above, where d < n,Ti = n'^WyW^ andT2 = n~^y^X{X'^X)~^X'^y. 
li d > n, then alternatives to T2 = n~^y'^X[X^X)~^X^y must be sought; in this paper, we 
focus on T2 = n~^||X"^y|p (remarks on other potential alternatives may be found in Section 
5). Using basic facts about the Wishart distribution (see Supplemental Text for formulas 
involving various moments of the Wishart distribution, which are obtained using techniques 
from (Graczyk et al., 2005; Letac and Massam, 2004) and are used throughout the paper), 
we have 

i^f^ll^^yir) = -2Ey'^XX^y 
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r H cr . 



n n 

Since E{n-^\\y\\'^) = + it follows that Ti = n-^||y||2 and T2 = n'^UX'^ylp satisfy (7). 
Moreover, T2 = n~^||X'^y|p is defined and (8) is valid even when d>n. Now let 



n 



and define 



^2 + n + 1 n \ c? + n + l,, ,,2 1 h-^t^ 1,2 

^ = Li -— , — = ||y|p- ^.J l-^ yll 

--2 r d n \ d ,, ,,2 1 M ^^T m2 

^ = ^ — TT'^T = — r~TT y + ^ ^iJ ^ y • 

\ n+ln + ly n(n + Ij n[n + Ij 

Making use of (5) and (8), a basic calculation implies that and are unbiased estimators 
for (T^ and r^. Thus, we have the following theorem. 

Theorem 1. [Unbiasedness] Suppose that S = I . Then Ei^a"^) = and -E'(t^) = r^. 
2.2. Consistency 

Let = {a'^^f'^Y' ^^^d let T = (?7.~"^||y|p,n~^||X^y|p)"^. The covariance matrix of 6 is impor- 
tant for understanding the asymptotic properties of and f^. Since = AT, where 



^ = ("^ , (9) 

V n+1 n+1 J 

it follows that Cov(0) = y4Cov(T)y4"^. The covariance matrices for Q and T are both computed 
explicitly in the Appendix. Asymptotic approximations for the entries of Cov(0) that are valid 
as (i/n — p G [0, 00) are given below: 

Var(<T2) ~ 1| 2 ^^2)2^^4 ^^4| ^^Q) 

n 

Var(f2) ~ 2 . 2^^2)2 _ ^4^3^41 ^^^^ 

Cov(a2,f2) ~ _2 rp(^2^^2)2^2rn. (12) 

The following theorem contains a slightly more detailed version of these approximations, and 
gives an explicit consistency result for a^, f^. The theorem is proved in the Appendix. 
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Theorem 2. [Consistency] Suppose that S = I. Then 

Var(a2) = I I ^(a' + r'f + a"" + A {l + O ( - 
n in J I 

n [n J I 

In particular, 



n'^ 

Remark 1. If d/n — >■ p G [0,oo), then the asymptotic approximations (10)-(12) follow imme- 
diately from Theorem 2. 

Remark 2. It is instructive to compare the asymptotic variance and covariance of a^, to 
that of the estimators (Tq, Tq, defined in (2)-(3). If n — )■ oo and d/n — p G [0, 1), then 



72(1 - p) 



n{l - p)' 

Notice that in (10), Var((3"^) increases with the signal strength r^, while Var(o"g) does not 
depend on r^. On the other hand, Var((T^) < Var((3"Q) when is small or p is close to 1. 

Remark 3. Suppose that ci,C2 > are fixed. Theorem 2 implies that if ci/n — > p G [0,oo), 
then 0"^, are consistent in the sense that 

lim sup E{a'^ - a^f = lim sup E{f^ - r'^f = 0. (13) 

d/n->-p o<(72<ci d/n^p o<^2^ci 

0<t2<C2 0<t2<C2 

On the other hand. Dicker (2012b) proved that if p > 0, then it is impossible to estimate (3 
in this setting. In particular, if p > 0, then 

liminf inf sup E\\^ - l3\\^ > 0, 

0<T^<C2 
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where the infimum is over all measurable estimators for (3. Thus, Theorem 2 describes methods 
for consistently estimating cr^ and in high- dimensional linear models, where it is impossible 
to estimate /3. If p G [0, 1), then (13) holds with (Tq, Tq in place of a^, f^. However, Theorem 
2 also applies to settings where d> n (i.e. p > 1) and the estimators undefined. □ 

2.3. Asymptotic normality 

Define the total variation distance between random variables u and v to be 

dTviu,v) = sup \P{u e B) - P{v e B)\, 



where i3(M) denotes the collection of Borel sets in M. The next theorem is this paper's main 
result on asymptotic normality. It is a direct application of results in (Chatterjee, 2009). 
Theorem 3 is proved in the Appendix and it is valid for arbitrary positive definite covariance 
matrices S. 

Theorem 3. [Asymptotic normality] Let Ai = ||?i~^X'^X|| be the operator norm ofn~^X'^X 
(i.e. Ai is the largest eigenvalue of n~^X^X). Let /i : ^> R 5e a function with continuous 
second order partial derivatives, let V/i denote the gradient of h, and let V^/i denote the 
Hessian of h. Suppose that ip"^ = Var{/i(T)} < oo and let w be a normal random variable with 
the same mean and variance as h(T). Then 

dTv{hinw} = 0{ \l\^J , (14) 



3/2^2 



where ^ and rj are defined as follows: 

e = aa^r^IJ,d,n) = 7^ + 72^' + To^M^ + 1) 

u = K^^r^^,rf,n) = vl^' + V'J' + vI^'At' + I) + I'J' + l'o^\r^ + I) 
and, for non-negative integers k, 

Ik = ^kia',T\E,d,n) = e(||VMT)||'(Ai + P^ 
Vk = r,k{a\T\E,d,n) = E <( 1 1 V'^T) I T (Ai + 1 
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Remark 1. If Hi^ll is bounded, then the asymptotic behavior of the upper bound (14) is 
determined by that of ^, and V'^, which, in turn, is determined by the function h. For the 
functions h considered in this paper, if d/n — ?■ p G [0, oo), then and nil)"^ are bounded by 
rational functions in cx^ and r^. Thus, if Hi^ll is bounded, d/n — !■ p G [0, oo), and a^, he in 
some compact set, then we typically have 

dTv{h{T),w} = 0{n-^'^). 

In other words, /i(T) converges to a normal random variable at rate Under these 

conditions, if ip"^ = Var{/i(T)} is known or estimable (as it is for the h studied here), then 
asymptotically valid confidence intervals for Eh(T) may be constructed using Theorem 3. □ 

Now let A be the matrix (9) and let , denote the first and second rows of A, respec- 
tively. Applying Theorem 3 with T = / and h{T) = afT = (X^, h{T) = a^T = f^, and 
h{T) = (a|'T)/(a^f T) = f^/a^ gives bounds on the total variation distance between a^, f^, 
and f^/o"^ and corresponding normal random variables. These examples are pursued in more 
detail below. 

Example 1 (a^ and f^). Let h{T) = ajT = a"^ in Theorem 3 and suppose that S = I. Then 
rjk = 0, because V^/i = 0. To bound 7^, we have 



= E { \\a,\\\X, + If ( -\\e\\' 



n 




Thus, 



By Theorem 2, 



Now let 



e = o|(l + 0'''(a^ + a + r^ + r)| 



Var(a2) = - [-{a^ + + a' + A[l + o(- 
n \ n \n 



^l = 2{%^ + r'f + a' + T'\ (15) 



n 
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and let z ~ A^(0, 1). Then Theorem 3 imphes 



"2 2 

cr^ — 



^1 

Similar calculations imply that 



z\ =0 



n \ n 



a + T 



^2 



z \ = 




where 



^2 



1 



n 



(cT + r I —a + or 



(16) 



Thus, we have the following corollary to Theorem 3. 

Corollary 1. Suppose that S = I and D C (0, oo) is compact. Let z ~ A^(0, 1). If d/n — )■ 

p G [0, oo), then 



sup dxv i 



^1 



sup dxv \ Vn 



f2_^2 



V^2 



where ipi,ip2 defined in (15)- (16). 

Example 2 (Signal-to- noise ratio) . Suppose that E = I . Define the function : M^\{0} x M — )■ 
M by 5'o(u) = gQ{ui,U2) = U2/U1 and let Hq = go o A he defined by ho{t) = go^At), where 
A is the 2x2 matrix given in (9). Then ho{T) = 5'o('5"^,f^) = f^/a^ is an estimate of 
the signal-to-noise ratio. However, Theorem 3 cannot be applied directly because Hq is not 
defined on all of (if ajt = 0, then /io(t) is undefined). To remedy this, we assume that 
ct'^jt'^ G D, where D C (0, 00) is compact and, moreover, that d/n — p G [0,oo). Now 
let : — )■ M be a function with continuous second order partial derivatives such that 
suPug:K2 ||V5'(u)||, supugiR2 ||V^5'(u)|| < 00 and g = go on Dq x Dq, where Do C (0, 00) is a 
compact set containing D in its interior. 

To show that the estimated signal-to-noise ratio is asymptotically normal, we apply Theo- 
rem 3 with h = g o A. Working under the assumption that cj^, E D and d/n ^ p E [0, 00), 
it is straightforward to check that 7^,77^ = 0(1), for k = 0,2,4,8; thus, = 0(1). To 
approximate the variance of /i(T), let 6 = (cr^, r^)-^ and 6 = (a^, f^)-^. A second order Taylor 
expansion yields 



h{T) 



9(0) 
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= g{e) + Vg{9f{o-e) + R\\e-o\\^ 

where R = 0(1). Theorem 2 and a straightforward calculation imply that 
Vai\vg{e)^6} = Vg{efCov{0)Vg{e) 



13 
(17) 



n 



Since Var - = 0(n-2) and R = 0(1), (17) implies 

2 



i.^ = Var (MT)} = -|j I (l + ^) (.^ + r^)^ - + r^)^} {l + O (i) } ^ 



Thus, Theorem 3 implies that 

d-TV 

where z ~ A^(0, 1) and 



0(n-^/2), 



^0^ 



(19) 



Finally, in order to relate (18) directly to ho(T) = f^/cr"^ and the signal-to-noise ratio r^/a 
notice that Theorem 2 implies 



P hiT) ^ 



O 



and equation (17) implies 



/ 1 

EMT) = - + - 



Combining these facts with (18), we obtain the following result. 

Corollary 2. Suppose that E = I and D C (0, oo) is compact. Let z ~ iV(0, 1). If d/n — 
p G [0, oo), then 



sup 



where ipQ is defined in (19). 
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3. Unknown S 

In this section, we propose estimators for cr^, for use when S is an unknown d x c? positive 
definite matrix. In Section 3.1, we consider the case where a norm-consistent estimator for S 
is available. In this setting, consistent (and, under certain conditions, asymptotically normal) 
estimators for o"^, are obtained by essentially transforming the problem to the E = I case. 
In Section 3.2, we consider the case where a norm-consistent estimator for E is not available. 
Here we derive alternative estimators for o"^, and these estimator are shown to be consistent 
and asymptotically normal under additional conditions on S and /3. 

3.1. Estimable S 

An estimator S for E is norm consistent if 1 1 — 1 1 — ?■ 0, where 1 1 X' — i7 1 1 is the operator norm 
of S — S and the convergence holds in some appropriate sense (e.g. convergence in probability 
or squared- mean). In high-dimensional data analysis where d/n ^ p > 0, the sample covari- 
ance matrix n~^X^X is not a norm-consistent estimator for E; furthermore, in the absence of 
additional information about Z", it is generally not possible to find a norm-consistent estima- 
tor for E. However, Bickel and Levina (2008), El Karoui (2008a), Cai et al. (2010), and others 
have shown that for wide classes of matrices S, norm-consistent estimators are available when 
d/n ^ p > 0. Moreover, one can reasonably envision situations in practice where pertinent 
prior information about the population predictor covariance matrix E is available (so that a 
reliable estimator of S may be found), but there is little prior information about f3 (so that 
(3 is not estimable and estimates of a^, based on residual sums of squares ||y — are 
suspect). Li and Zhang (2010) discuss relevant examples from genomics and fMRI with highly 
structured high-dimensional predictors, though they focus on variable selection problems. 
Suppose that X is a positive definite estimator for E and define the estimators 

= ^^f^||y|p--^||(xi;-VYy|p 

n[n + 1) n[n -\- I) 

= -^—\\y\\' + —^—\\iXIJ-'/''fy\\^. 

^ ' n(n + l)"-^" n(n + l)"^ ^ " 

Notice that = ^^(J) and = f2(J). Now let Z = (zi, z^f = XIJ-^/'^. Then zi, z„ ~ 
N{0,I) and all of the results from Section 2 apply to the estimators o"^(Z'), f^(Z'), with Z, 
in place of X, /3, respectively. Since 

a\lJ) = a\S)-^^{\\{XIJ-yry\\'-\\Z^y\\^} 
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and 



= a\E) + O |-^||Z^y|n|i;i/2j.-i^i/2 _ 

rXs) = f^(r) + -^{||(xi;-/Yy||2_||z-y||2} 

= t\E) + O !^^\\Z^y\\^\\E'/^E-'S'/^ - J||| , 



(20) 



(21) 




we conclude that if | /| | is small, then asymptotic properties of (T^(Z') and t^(Z') 

are determined by those of (T^(i^) and f^(i7). This is illustrated in the following proposition, 
which is a direct consequence of (20)-(21) and the results of Section 2. 

Proposition 1. Let S be a positive definite estimator for E. Suppose further that 
\\E~^\\=Op{l). 

(i) [Consistency] 

\a\E)-a\ \t\E)-t^\-- 

(a) [Asymptotic normality] Letipi, ip2, andipo be as defined in (15), (16), and (19). Suppose 
that d/n ^ p E [0, oo) and that cr^,r^ G D for some compact set D C (0, oo). // 
- r|| = op{n-^/'^), then 

where indicates convergence in distribution. 

Remark 1. Part (i) of Proposition 1 implies if cr^, are bounded, d = o(n^), and HX" — i^H = 
op(l), then o"^(X') and t^(X') are weakly consistent for and r^, respectively. 

Remark 2. If HZ" — i^H = op(n"^/^) and the other conditions of Proposition 1 are met, then 
(T^(Z'), f^(i7), and f'^{S)/a'^{S) are asymptotically normal with the same asymptotic variance 
as a'^{S), T^(Z'), and f'^{S)/a'^{S), respectively The condition HZ" — i^H = op(n~^/^) is quite 
strong. However, Bickel and Levina (2008) and Cai et al. (2010) describe broad classes of 
covariance matrices S that can be estimated at this rate. For concreteness, we note that if 
the entries of Xj follow one of many common time series models (e.g. AR(A;) for fixed k), then 
there exist estimators IJ such that HZ — ZH = op(n~^/^) when d/n — )■ p G (0, oo). □ 
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3.2. Non-estimable S 



Define r| = 0^ S^f3 and nik = d k = 0, 1,2, .... Then = rf. For general positive 

definite matrices S, one easily checks that 



-E\\y\\' 
n 

\E\\X^y\\^ 



2 I 2 

a + 

—rriia H — rriiT, 
n n 



n 



(22) 
(23) 



and 



d{l — mi) + n + 1 2 (i(l — mi) + n + 1 



-cr 



+ 1 

(i(mi — 1)2 (i(?ni — 1) 



-cr 



n + 1 



n 



2 I 2 
n + ^2 • 



To 



Thus, if 7^ J, then a^, are typically not unbiased estimators for cr^, r^, respectively. 



More generally, it follows that if 7^ 
L(ai,a2) = ain""*"! |y| p+a2n~^ 
as seen in Section 2, if = J 



/, then the expected value of the linear combination 

, r|, and tr(Z'). By contrast. 



X'^y\ p typically depends on a^. 



(in addition to ai, 02, 



then = = r| and EL{ai,a2) is determined by cr^ and 
; indeed, in the E = I case, this fact is precisely what is 



leveraged to obtain unbiased estimators for cr , r . This suggests that an alternative method 
for estimating cr^, may be necessary when S is unknown and non-estimable. 

In this section, we do not completely abandon our strategy of estimating cr^, by using 
linear combinations of ?7,~^||y|p and ?7.~^| |X-^y| p. Rather, we propose modified versions of 
and that are consistent and asymptotically normal, provided (3 and S satisfy certain condi- 
tions that have appeared previously in the random matrix theory literature. These conditions 
are stated below. 

(A) As c? — 7- 00, the empirical distribution of the eigenvalues of E converges weakly to 
a probability distribution with support contained in a compact subset of (0, 00) and 
cumulative distribution function H. 

(B) Let 

Mk = j dH{x) and ^ 



where the distribution H is given in condition (A). Then, as (i — ?■ 00, 



A,. ^0, A; = 1,2,3. 



(24) 
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Condition (A) is fairly standard and is frequently assumed to hold in asymptotic analy- 
ses in random matrix theory (Bai et al., 2007; Bai and Silverstein, 2004; El Karoui, 2008b; 
Marcenko and Pastur, 1967). The compact support requirement in condition (A) can likely 
be relaxed; however, this is not pursued further here. Condition (B) is more specialized and 
requires that the parameter /3 interacts with E as determined by (24). In fact, while condition 
(B) is sufficient for our consistency results in this section, we require a stronger version of 
condition (B) (stated precisely in Proposition 2 (ii)) to obtain asymptotic normality. Bai et al. 
(2007) and Pan and Zhou (2008) have proposed conditions that are closely related to (B) and 
the strengthened version of (B) appearing in Proposition 2 (ii) (in fact, their conditions are 
stronger, if H has finite moments). Bai et al. (2007) have noted that under condition (A), if 
E is an independent, orthogonally invariant random matrix (e.g. if Z' is a Wishart matrix 
and E{E) = cl, for some constant c > 0), then condition (B) holds for any (3. Furthermore, 
(Bai et al., 2007) point out that for any E there must exist some (3 such that condition (B) 
holds; for instance, take (3 = u, where u = ?7,~^/^(ui + ■ ■ ■ + u^) and Ui, u^^ are orthonor- 
mal eigenvectors of E. More broadly, (B) may be interpreted as requiring that f3 and E are 
asymptotically free. 

Presently, we provide a heuristic to motivate estimators for and under conditions (A) 
and (B). Following the method of moments, the identities 

-Etr ( -X^x] = mi and -Eti | ( -X^xVl = + f 1 + - V2 
a \n J a I \n J \ n \ n J 

suggest that 

^1 = 3tr (-X^x\ and ms = ,, tr | (-X^x\ \ - ^r^-rtr (-X^x\ 

d \n J d{n + l) \\n J \ d{n + l) \n J 

are reasonable estimators for mi and m2, respectively. Now assume that d^n are large and 
d/n p ^ [0,00). Then, for /c = 1,2, condition (A) implies that rhk ~ nik ~ Mk and (B) 
implies r| r'^mk/rhi. Combining these approximations with equations (22)- (23) yields 

l^llylp = a^ + T^ (25) 

n 

\E\\X^yf ^ ^mia^ + l^mi+fl + l'l^lrl (26) 
n yn \ n J mi J 

Observe that the right-hand side of (25)- (26) consists of linear combinations of and r^, 
with coefficients determined by the known quantities d, n, mi, and m2- Thus, we are able 
to obtain nearly unbiased estimators of and by taking linear combinations of ^~^||y|P 
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and ?7,~^| iX-'^yl p, with coefficients determined by ci, n, mi, and m2- In particular, define the 
estimators 

_9 ^ I , dm^ nrhi 



f^ = L 



(n + 1)7712' (n + l)m2 
dm\ 1 1 III ,2 rhi . .^j. 

n + l)m2 J n n{n + l)m2 

dTTil nrhi 



(n + l)m2 ' (n + l)m2 

drhl 2 ^1 



y|| 11^ y|| • 



n{n + l)m2 n{n + l)m2 

A basic calculation using (25)-(26) suggests that E{a^) ^ cr^ and -E'(r^) ~ r^. 

Proposition 2 summarizes some asymptotic properties of and f^. An outline of the proof, 
which is fairly straightforward, may be found in the Appendix. 

Proposition 2. Suppose that condition (A) holds, that D C (0, 00) is a compact set, and 
that a'^^r'^ G D. Suppose further that there exist constants ci,C2 in M such that either < 
Ci < d/n < C2 < 1 or 1 < Ci < d/n < C2 < 00, and suppose that \n — d\ > 9. Define 
Afc = Ai + \mi — Mi\ + ■ ■ • + Afc + \mk — Mk\, where Mj and Aj are defined in condition (B), 
and mj = d~^tT{S^). 

(i) [Consistency] 

E{d^ - ay, E{f^ - = O (^^^ + 

Thus, if condition (B) holds, then |cr^ — a^|, — — )■ m mean-square, 
(a) [Asymptotic normality] Suppose that condition (B) holds, with the additional require- 
ment that A2 = o(n~^/^), and let 



V 777712 mi 



[ mim-i\ 4 77ii77i3 4] 

V m^2 J "^2 J 



I = 2|f^ + ^Va^ + rY-^a^+r2 + ^V' 



2 



777n \ 777 



^0 



[ \ 777772 7772 / 

/ dmj 777i7773\ , , 2x4 "-^l^-^S „4 ^ ^2 , 2x2 

+ — + ^ ) 2~^ + ^ ) 

\ 777772 ^^2 / "^2 



2 
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Then 




iV(0,l). 



Remark 1. The conditions in Proposition 2 that require |n — (i| > 9 and d/n to be bounded 
away from 1 are related to the fact that rhj ^ appears in both and f^. In particular, the 
mean-squared error of and may be infinite ii n — d is not large enough. 

Remark 2. The condition A2 = o{n~^/'^) in part (ii) of Proposition 2 is quite strong. For 
instance, if is a sample covariance matrix formed from iid A^(0, ctq) data with a constant 
aspect ratio, then condition (B) is satisfied, but A2 7^ o(n~^/^). On the other hand, if E 
is a constant multiple of the identity matrix, then A2 = o(n~^/^). We emphasize that only 
conditions (A) and (B) are required for 5"^ and to be consistent; A2 = o(n~^/^) is required 
for asymptotic normality. 

Remark 3. If = /, then mi = m2 = ms = 1 and V'l = V^f, j = 0, 1, 2, where V'l are given 
in (15)-(16) and (19). In other words, if Z = J, then the asymptotic variance of 5"^, f^, and 
f^/o"^ is the same as that of a^, f^, and r^/o"^, respectively. This is driven by the fact that if 
d/n p E (0, 00), then \mk — mk\ converges at rate . □ 

4. Numerical results 

In this section, we study the performance of the proposed estimators for cr^, r^, and the 
signal-to-noise ratio r^/cr^ via simulation. We consider three examples. In the first example, 
we report the results of a simulation study that illustrates the performance of the estimators 
from Section 2 (for E = I) and Section 3.2 (unknown, non-estimable E)\ the predictors 
Xj are generated from various distributions (including non-normal distributions) that are 
described below. In the second example, we compare the performance of = (3"^(/) to that 
of a"Q = (ra — (i)~^||y — XI3^iJ\\'^ in settings where d < n. \n the final example, we compare 
the performance of estimators proposed in this paper to that of the scaled lasso and MC-|- 
estimators for a^. These estimators for were proposed by Sun and Zhang (2011) for settings 
where (3 is sparse; in our simulation study, we consider cases where (3 is sparse and non-sparse. 

Example 1 

In this example, d = 1000 and the predictors Xj G M}^^^ were generated according to one of 
three distributions. In the first setting, Xj ~ N{0,I). In the second setting, we generated a 
{2d) X d (2000 X 1000) random matrix Z with iid A^(0, 1) entries and took S = {2d)-^Z^Z; 
the iid predictors Xj were then generated according to a A^(0, S) distribution (the same matrix 
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E was used for all datasets generated under this setting). In the third setting, the individual 
predictors Xij, i = 1, ...,n, j = 1, ...,d, were iid random variables taking values in {±1} with 
P(x,, = 1) = P{x,j = -!) = 0.5. 

To generate the parameter /3 G M^"*^*^, we created a 1000-dimensional vector with the first 
d/2 = 500 coordinates iid uniform(0, 1) and the remaining d/2 = 500 coordinates iid A^(0, 1); 
f3 was obtained by standardizing this vector so that ||/3|p = Tq = 1 (the same f3 was used 
for all simulated datasets in this example). The residual variance was fixed at o"^ = 1 and we 
considered datasets with n = 500 and n = 1000 observations. 

For each setting in this example, we generated 500 independent datasets and computed the 
estimators 5"^ = cr'^{I), = 'r^(/), f^/a^ = f'^{I)/a'^{I) and a^, f^, f^/a^ (the estimators 
proposed in Section 2 and Section 3.2, respectively) for each dataset. Recall that the estimators 
from Section 2 were derived under the assumption that Xj ~ A^(0, /) and the estimators from 
Section 3.2 were derived under the assumption that ~ A^(0, S), where S satisfies conditions 
(A)-(B). Summary statistics for the various estimators are reported in Table 1. 









^iV(0,/) 






Xi g {±1} binary 


Estimator 


n 


Mean 


Std. Error 


Mean 


Std. Error 


Mean 


Std. Error 




500 


1.0118 


0.1999 (0.2000) 


0.5552 


0.2839 


1.0079 


0.1976 




1000 


1.0003 


0.1092 (0.1095) 


0.5428 


0.1576 


1.0035 


0.1076 


a' 


500 


1.0120 


0.2005 (0.2000) 


1.0283 


0.1832 


1.0039 


0.1984 




1000 


1.0003 


0.1096 (0.1095) 


1.0237 


0.1017 


1.0014 


0.1077 


r\I) 


500 


0.9847 


0.2364 (0.2366) 


1.4182 


0.3396 


0.9937 


0.2442 




1000 


0.9986 


0.1408 (0.1414) 


1.4408 


0.2007 


1.0015 


0.1402 




500 


0.9846 


0.2366 (0.2366) 


0.9450 


0.2261 


0.9977 


0.2452 




1000 


0.9986 


0.1410 (0.1414) 


0.9600 


0.1335 


1.0036 


0.1403 




500 


1.0687 


0.5329 (0.4195) 


1.5685 


22.6089 


1.0801 


0.5262 




1000 


1.0234 


0.2531 (0.2366) 


3.0488 


2.5593 


1.0212 


0.2415 




500 


1.0694 


0.5371 (0.4195) 


0.9881 


0.4315 


1.0901 


0.5343 




1000 


1.0236 


0.2538 (0.2366) 


0.9573 


0.2209 


1.0256 


0.2426 



Table 1 

Summary statistics for Example 1 (d ~ 1000^. Means and standard errors of various estimators, computed 
over 500 independent datasets for each configuration. In each setting, cr^ = = r^/cr^ = 1; thus, unbiased 

estimators should have mean close to 1. In the standard error column corresponding to ~ N{0,I), 
numbers in parentheses are theoretically predicted standard errors (denoted ipi, "4^2 , imd "00 'i^ the text; see 
Corollaries 1-2 and Proposition 2). Theoretically predicted standard errors for x^ N(0, S) and x^ G {±1} 
binary are not known; more details may be found in the discussion in Section 4-1- 



One of the more striking aspects of the results reported in Table 1 is the consistency and 
robustness of the estimators 5"^, f^, and f^/cr^. Proposition 2 suggests that these estimators 
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might be expected to perform well when Xj ~ A^(0, /) and Xj ~ iV(0, E)\ none of our theoreti- 
cal results apply to the case where Xj G {±1} is binary. In the settings where Cov(xj) = I and 
Xj G {±1} is binary, the performance of 5"^, f^, and f^/a^ is nearly indistinguishable from that 
of <j2(/), f2(J), and f2(/)/a2(/). On the other hand, when Cov(xi) = E = {2d)-^Z^Z, the 
estimators (3"^(/), f^(J), and f^(J)/o"^(J) break down significantly (their mean is far from the 
actual value = = r^/a^ = 1), while a^, f^, and f^/a^ still perform effectively. The esti- 
mators (5"^(/), and f^(/)/(3"^(/) were developed under the assumption that Cov(xj) = I. 
Thus, their diminished performance when Cov(xi) 7^ J is not unexpected. The dramatically 
high standard error 22.6089 for f^(J)/o"^(/), when Xj ~ A^(0, S) and n = 500 is indicative of 
instability when is very small; it also serves as a prompt to point out that our estimators for 
cr^ and can take both positive and negative values. Since a^, > 0, negative values for the 
estimators may be undesirable. In practice, one might choose to implement special procedures 
for handling negative estimates of these quantities; however, we take no such steps here. In 
this example, the only negative estimates of cx^ and occurred for when Xj ~ A^(0, S): 
for n = 500, there were 18 datasets (out of 500) where o"^(J) < 0; for n = 1000, there was one 
dataset where (3"^(/) < 0. 

For Xj ~ A^(0, /), Table 1 indicates that the empirical standard errors of the estimators for 
0"^ and are extremely close to the values predicted by Corollary 1 and Proposition 2 (ii) 
(denoted ipi and ip2, respectively; these values are displayed in parentheses in Table 1). For the 
estimators of the signal-to-noise ratio r^/a^, the agreement between the empirical standard 
errors and the theoretically predicted standard error ipQ (see Corollary 2 and Proposition 2 
(ii)) is less compelling. For n = 500, the empirical standard errors for estimates of r'^ja^ 
are roughly 25% larger than the theoretically predicted standard errors. For n = 1000, the 
empirical and theoretical values are closer (they differ by approximately 10%); however, the 
discrepancy is still substantially larger than that for estimates of and r^. Figures 1 and 2 
contain histograms of the estimators for o"^, r^, and r^/cr^. Normal density plots with mean 
1 (the actual value of cr^, r^, and r^/a^ in this example) and variance ijjf, 1/^2, and i/jq are 
superimposed on the histograms. The histograms and normal densities seem to agree quite 
well, as predicted by Corollaries 1-2 and Proposition 2. 



For Xj ~ A^(0, S), with S = {2d)~^ Z'^ Z , one might hope to use Proposition 2 (ii) to derive 
theoretically predicted standard errors for the estimators 5"^, f^, and f^/a^. However, in order 
for Proposition 2 (ii) to apply, we must have ^/n E^f3 — WfiW^ dr^ii{E^) \ ^ 0, for k = 1,2. 
In this example, we had /3^r/3 = 0.9831 and (S'^S^fB = 1.4436, while 1 1/3| |2rf-itr(r2) = 1.0003 
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Estimated residual variance: o (I) 



4 



0.0 0.5 1.0 1.5 2.0 



Estimated residual variance: o (I) 




Xi~ N(0,l), n = 500 

Estimated signal strength: t^(I) 




Xi'~N(0,l), n = 1000 

Estimated signal strength: (I) 



\ 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 

f (l)/fi' (I) 



Estimated signal-to-noise ratio: t (I) 




00 0.5 1.0 1.5 2.0 2.5 3.0 



Fig 1. Example 1 (d = lOOOj. Histograms and normal density plots for the estimators t'^{I), and 

f^(/)/(T^(/), with Xi ^ N{0,I). Top row, n = 500; bottom row, n = 1000. Superimposed normal density plots 
have mean 1 and variance ipf, ip2, and 'ipQ for a'^{I), t'^(I), and t'^ (I) / a'^ (I) , respectively. Corollaries 1-2 
suggest that the distribution of the various estimators should be approximately equal to that of the corresponding 
normal distribution. 



and ||/3||2rf-Hr(r) = 1.5018. Thus, 



v/500 



0rsf^ _ M^tr(i;) 



0.3839, yiOOO 
= 1.3002, v^IOOO 



^Ti;/3-M^tr(i;) 



0.5429 



1.8387, 



(27) 



which suggests that the apphcabihty of Proposition 2 (ii) may be questionable. Moreover, 
asymptotically, if = {2d)~^Z^Z and d — ?■ oo, then the it is known that conditions of 
Proposition 2 (ii) are not satisfied (see Remark 2, following Proposition 2). Nevertheless, we 
believe it is informative to compare the empirical distribution of the estimators a^, f^, and 
f^/a^, to normal distributions with mean 1 and variance ipl, ip2, and ip^, respectively, as 
specified by Proposition 2 (ii); corresponding histograms and normal density plots may be 
found in Figure 3. Upon visual inspection of Figure 3, the fit between the sampling distribution 
of the estimators and the corresponding normal distribution appears to be reasonably good. 
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Fig 2. Example 1 (d~ 1000/ Histograms and normal density plots for the estimators a'^ , f^, and jcP' , with 
Xi ~ iV(0, J). Top row, n ~ 500; bottom row, n = 1000. Superimposed normal density plots have mean 1 and 
variance ipf, and ip"^ for a"^ , t"^ , and jcP' , respectively. Proposition 2 (ii) suggests that the distribution 
of the various estimators should be approximately equal to that of the corresponding normal distribution. 



The results in Table 1 indicate that there is slightly more bias in the estimators when Xj ~ 
7V(0, S) than when Xj ~ iV(0, /); this may be a result of the discrepancies (27). 



Though this paper contains no theoretical results describing the behavior of our estimators 
for non-normal data, the numerical results in this example suggest that some of the methods 
proposed here may be successfully applied in broader circumstances. The results in Table 1 
for Xj G {±1} binary show that all of the estimators considered in this example are nearly 
unbiased and have standard errors that are similar to the corresponding standard errors in 
the case where Xj ~ A^(0,/). Figure 4 contains histograms for the estimators a^, f^, and 
f^/o"^, with Xj G {±1} binary. Normal density plots with mean 1 and variance ifjl, ipli ^"^^ 
ipQ are superimposed on the histograms; these are the normal densities corresponding to the 
asymptotic distribution of the estimators in the case where Xj ~ A^(0, /) (see Corollaries 1-2 
and Proposition 2 (ii)). The histograms appear to match the densities quite well. 
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Estimated residual variance: S 



Xi~N(0, S), n = 500 

Estimated signal strengtii; 



Estimated signal-to-noise ratio: t /a 



- ^flntfiTT-h. — 
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Estimated residual variance: a 



Xi- N(0, 1), n = 1000 

Estimated signal strength: 



Estimated signal-to-noise ratio: t^/o^ 



A 



0.0 0.5 



1.5 20 



0.0 0.5 1.0 1.5 2.0 



0,0 0.5 1.0 1,5 2.0 2.5 3.0 



Fig 3. Example 1 (d ~ lOOOj. Histograms and normal density plots for the estimators a'^ , f^, and jcP' , with 
Xi ^ N{0,S) and S = {2d)^^Z'^Z. Top row, n — 500; bottom row, n ~ 1000. Superimposed normal density 
plots have mean 1 and variance "01, and tp'^ for cP' , , and t'^ ja^ , respectively. For n = 500, ipi = 0.1835, 
■02 = 0.2211, and ipo = 0.3841; for n = 1000, i'l = 0.1054, -02 = 0.1383, and ipo = 0.2290. See Table 1 for 
empirical standard errors of estimators. 



4-2. Example 2 

When (i < n, (Tq = (ra — (i)^-'^||y — X^^iJ\\'^ is a widely used estimator for cr^. In Remark 
2 following Theorem 2, we noted that the variance of does not depend on r^, while the 
variance of (3"^(/) and the other estimators for cr^ proposed in this paper increases with r^. 
On the other hand, as d/n f 1, the variance of cTq diverges, while that of o"^(J) remains 
bounded. In this brief example, we took Xj ~ N{0,I), = t"^ = 1, and n = 500, and 
investigated the numerical performance of ct^(/) and cTq for various values of li < n. Five 
hundred independent datasets were generated and the estimators were computed for each 
dataset. Summary statistics are reported in Table 2. 



Table 2 indicates that in each setting, the estimators are nearly unbiased: the means of the 
estimators are close to 1. The empirical standard errors of and (Jq both increase with 

d] however, the standard errors increase more rapidly for (Tq. At d = 250, 350, the empirical 
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Estimated residual variance: S 



Xij binary, n = 500 

Estimated signal strength: 



Estimated signal-to-noise ratio: i /5 



0.0 0.5 1.0 1.5 2.0 



0.0 0.5 1.0 1.5 2.0 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 



Estimated residual variance: 6 



J 



X|j binary, n = 1 000 

Estimated signal strength: 



.Ik. 



0.0 0.5 1.0 1.5 2.0 



0.0 0.5 1.0 1.5 2.0 



Estimated signal-to-noise ratio: t /S 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 



Fig 4. Example 1 (d~ lOOOj. Histograms and normal density plots for the estimators a'^ , f^, and jcP' , with 
Xi G {±1} binary. Top row, n = 500; bottom row, n = 1000. Superimposed normal density plots have mean 1 
and variance 'ipi, "02' '^'^'^ V'o f^''^ ^'^ ' ^^'^ ^'^ 1^'^ j respectively. 



standard error of a\ is smaller than that of (5"^(/); at c? = 450, the trend reverses and the 
empirical standard error of (J^(/) is smaller than that of d\. As d becomes closer to n = 500, 
the empirical standard error of should remain bounded, while that of should diverge to 
oo. The results reported in this example suggest that even when d <n, there may be settings 
where the estimators proposed in this paper may be preferred to over other commonly used 
estimators for o"^; for instance, when d < n, but d is very close to n. 

4.3. Example 3 

Sun and Zhang (2011) proposed methods for estimating in high- dimensional linear mod- 
els that are very effective when (3 is sparse. These methods use modified versions of lasso 
(Tibshirani, 1996) and MC+ (Zhang, 2010), (referred to as "scaled lasso" and "scaled MC+," 
respectively) to simultaneously estimate and (3. Let (7^^^^ and (3"^c+ denote the scaled lasso 
and scaled MC+ estimators for o"^. In this example, we compared the performance of (Ti^^^ 
and (3"^c+ with some of the estimators for cr^ proposed in this paper, in settings where (3 was 
both sparse and non-sparse. 
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Estimator 


d = 250 


fi = 350 


d = 450 


Mean <t^ (/) 


0.9984 


0.9986 


0.9965 


-I 


0.9979 


1.0004 


0.9902 


Standard error d^'^{I) 


0.1290 


0.1389 


0.1457 




0.0901 


0.1141 


0.1947 



Table 2 

Example 2 (n — 500, — 1). Means and standard errors of estimators for , based on 500 independent 

datasets. 

With d = 3000, the predictors in this example were generated according to Xj ~ N{0, S), 
where S = (aij) and aij = 0.5'*"-^'. We fixed = 1. Sparse and non-sparse (dense) pa- 
rameters /3 G M'^ were generated as follows. First, to generate the sparse f3, five random 
multiples of 25 between 25 and d — 25 = 2975 were selected. That is, we selected ki,...,k^ 
from {25, 50, 75..., 2975} independently and uniformly at random. Next, we took (3^ G M'^ to 
be the vector with the 7-dimensional sub- vector (1, 2, 3, 4, 3, 2, 1)-^ centered at the coordinates 
corresponding to fci, (so that the kj-ih. entry of (3^ was 4, the {kj it l)-th was 3, etc.); 
the remaining entries in /3q were set equal to 0. We then set j3 = {3/{0^ Ef3Q)Y^'^f3(j, so that 

= 0^ S(3 = 3. Note that this sparse /3 was generated only once; in other words, the same 
sparse (3 was use throughout the simulations in this example. To generate the dense /3 used 
in this example, we followed the same procedure as for the sparse /3, except that in /3o, the 
7-dimensional subvector (1, 2, 3, 4, 3, 2, 1)-^ was centered at coordinates corresponding to each 
multiple of 25 between 25 and 2975. Notice that for the sparse (3, we had ||/3||o = 7 x 5 = 35, 
where ||/3||o denotes the number of non-zero coordinates in /3, and for the dense f3 we had 
||/3||o = 7 X {d/25 — 1) = 833; however, = 0^ S(3 = 3 was the same for both the sparse 
and dense (3. In this simulation study, we considered datasets with n = 600 and n = 2400 
observations. With sparse /3 and n = 300, the simulation settings in this example are very 
similar to those in Example 1 from Section 4.1 of (Sun and Zhang, 2011). 

Under each of the settings described above, we generated 100 independent datasets and, 
for each simulated dataset, we computed d'fg^^^, cr1^c-\-y the scaled 

lasso and MC-|- estimators, we used the shrinkage parameter Aq = >/\og{d)Jn (this value 
of Aq yielded the best performance in the numerical examples in (Sun and Zhang, 2011)). 
The scaled MC-|- estimator requires specification of an additional parameter 7; following 
(Sun and Zhang, 2011), we took 7 = 2/[l — maxjjjXf Xj/(| |X,j| 1 1 |Xj| |)}], where Xj denotes 
the j'-th column of X. The estimator (T^(i^) was introduced in Section 3.1 of this paper. Here 
we take advantage of the AR(1) structure of S and set E = (aij), where aij = d'*"-'! and 
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We view the estimator (9"^(i7) as an "oracle estimator," which utihzes full knowledge of actual 
covariance matrix S\ this estimator should perform similarly to the estimator (5"^(/) in settings 
where Cov(xj) = / and rf = 3 (see the discussion in Section 3.1). Finally, the estimator 5"^ is 
the "unknown covariance" estimator from Section 3.2. Recall that our theoretical performance 
guarantees for (Proposition 2) require that — \\f3\\Hi{S^)/d\ ^ 0, for k = 1,2. In 

this example, for the sparse f3 we had 

^^tr(i;) - /3^r/3 = -1.7551 and ^-^ti( S^) - (3^ = -5.5409 (28) 
a a 

(the corresponding quantities are essentially the same for the dense f3). Summary statistics 
for the various estimators computed in this numerical study are reported in Table 3. 



Sparse /3 Dense f3 





Mean 


Std. Err. 


Mean 


Std. Err. 


n = 600 <Til,,, 


1.1117 


0.0651 n = 600 


3.2600 


0.2070 




1.0477 


0.0633 rj^c+ 


3.1005 


0.2107 




0.9704 


0.5049 (t2(i;) 


0.9820 


0.5641 




0.9693 


0.5021 o'-Xl!) 


0.9835 


0.5596 




-0.6023 


0.5182 a"- 


-0.5747 


0.5876 


n = 2400 <7il_ 


1.0310 


0.0295 n = 24GG ajl,,. 


2.3232 


0.0706 


'''MC+ 


1.0060 


0.0293 ^^c+ 


1.9997 


0.0778 




0.9808 


0.1633 (t2(i;) 


1.0095 


0.1538 




0.9809 


0.1631 o'-Xe) 


1.0095 


0.1537 


a' 


- 0.5827 


0.2084 


-0.5702 


0.2228 



Table 3 

Example 3 (d — 3000, a"^ = 1). Means and standard errors of estimators for , based on 100 independent 

datasets. Left table, sparse (3; right table, dense (3 



For sparse /3, the results in Table 3 indicate that af^ggo, o"^c+) ^^(^)) and cr'^{S) are all 
nearly unbiased (recall that o"^ = 1 in this example). However, the empirical standard errors 
for the scaled lasso and MC-|- estimators are considerably smaller than the standard errors 
for ci'^{S) and d^{S). Note that in this example, the performance of (3"^(Z') is very similar to 
that of the oracle estimator (T^(Z'). 

The estimator 5"^ is significantly biased in this example. Indeed, the mean value of 5"^ 
is negative, while > 0. The poor performance of in this example is not completely 
unexpected, given that S^f3 — ||/3|ptr(Z'*^)/(i| is substantially larger than for = 1,2 
(see (28)). In fact, more can be said. Using the approximation rhk ~ mk = ii{S^)/d, k = 1,2, 
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one can check that 

Thus, the bias of 5"^ is approximately rf — {mi/m2)T2 - In this example, — (mi/m2)r| = 
-1.5700 and 

^(a^) ^ -0.5700. 

(this calculation is for the sparse /3; the result is almost exactly the same for the dense /3). 
Note the similarity between this approximation and the empirical means of in Table 3. 

For dense /3, the performance of crf^sso ^^"^ ^mc+ breaks down, while the performance of 
(T^(ii'), (3"^(Z'), and 5"^ remains virtually unchanged, as compared to the sparse (3 case. When 
n = 600, the empirical means of a'^^^o ^^"^ ^mg+ both greater than 3; when n = 2400, 
the empirical means of (jfasso '^mc+ both nearly greater than 2. Both af^sso '^mc+ 
depend on associated lasso and MC+ estimators for /3. The performance break-down of d'f^sso 
and (J^c_,_ when f3 is dense is likely related to the fact that the corresponding estimators for 
f3 perform poorly when f3 is dense and d/n is large. In Table 4, we report the empirical mean 
squared error for the lasso and MC+ estimators for (3 that are associated with afasso 
d'lic^; note that mean squared error is substantially higher for estimating dense /3. 

Sparse f3 Dense f3 



n 


lasso 


MC+ 


n 


lasso 


MC+ 


600 


0.1888 


0.3696 


600 


1.2176 


1.2457 


2400 


0.0514 


0.0894 


2400 


0.8961 


0.9337 



Table 4 

Example 3 (d = 3000, = 1, ||/3|P = 1.2449j. Empirical mean squared error \\(3 — /3|p of the scaled lasso 
and MC-h estimators for (3, based on 100 independent datasets. 



Overall, the results of this simulation study suggest that estimators proposed in this paper 
may be useful for estimating cr^ in settings where d/n is large and little is know about sparsity 
in (3. However, we emphasize two important points: (i) additional information about the 
covariance matrix S may be required to obtain consistent estimators for (e.g. that E 
has AR(1) structure) and (ii) the estimators for proposed in this paper may have larger 
standard error than estimators derived from a reliable estimate of (3. 

5. Discussion 

In this paper, we proposed new estimators for cr^, r^, and the signal-to-noise ratio r^/cr^ 
in high- dimensional linear models. These estimators are based on linear combinations of 
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Ti = ?T-~"^||y|P and T2 = n~^||X-^y|p. Working under the assumption that Cov(xj) = /, the 
key observation in deriving these estimators was that ETi, ET2 form a pair of non- degenerate 
hnear combinations involving o"^ and r^. In fact, as described in Section 2.1, unbiased estima- 
tors for cr^ and may be derived from any pair of statistics Ti,T2 satisfying this property. 
With Ti = n~^||y|p fixed, we presently discuss two alternatives for T2, which may yield other 
estimators for cr^, in this manner. These examples are not meant to be exhaustive; rather, 
they are illustrative of this technique's flexibility and raise some broader questions about 
estimating o"^ and in high-dimensional linear models. 

First, let U G 0{d) he a d x d Haar-distributed orthogonal matrix independent of (y, ^) 
and let Uk denote the first k columns of U, where 1 < k < mm{d,n}. Then one may take 
T2 = n-^E{\\Pky\\^\y,X), where Pk = Xk{XlXk)-^Xl and Xk = XUk, so that Pk is a 
random rank-A; projection. As a second alternative to T2 = n~^| jX-^'yl p, one could take T2 = 
|X/3,^j^^g| P, where /S^j^^e is some ridge regression estimator for /3 (Hoerl and Kennard, 
1970). One aspect of these alternatives' potential appeal is that they might yield consistent 
estimators for o"^ and with smaller variance than the estimators studied in this paper. How- 
ever, a theoretical analysis of these estimators' properties may be somewhat involved. Indeed, 
for T2 = n~^E (I |Pfcy| p|y, X), it is easy to calculate ET2 and find the corresponding unbiased 
estimators for o"^ and using symmetry arguments (provided Cov(xj) = J), but comput- 
ing the variance of these estimators appears to be fairly challenging. If T2 = n~^\\X0^^^g^\\'^, 
then closed-form expressions for ET2 and, consequently, for the associated unbiased esti- 
mators of (T^, are generally not available; however, results from random matrix theory 
suggest that simplified asymptotic analyses may be possible. Note that in order to imple- 
ment either of these alternatives to T2 = n~^||X-^y|p, specification of an additional tuning 
parameter is required: for T2 = n~^E (| |Pfey| p|y, X), the rank parameter k must be specified; 
for T2 = n~^\\X f3j.idge\\'^: the ridge shrinkage parameter (typically, a nonnegative constant 
denoted by A) must be specified. 

A number of questions are raised by the examples discussed in the previous paragraph. For 
instance, it is clear that estimators for a^, derived using different statistics Ti, T2 may (or 
may not!) be more efficient than the estimators a^, studied here; however, an exhaustive 
study of all pairs Ti, T2 aimed at identifying the optimal estimators for cx^, is likely impossi- 
ble. This suggests the need for a more unified approach to studying efficiency and optimality 
for estimating o"^ and in high- dimensional linear models, which, given the ambiguity of 
likelihood-based approaches noted in Section 1.3, may be challenging. Additionally, while we 
have shown that the proposed approach to estimating and based on linear combinations 
of statistics Ti, T2 is effective when Cov(xj) = and that this approach may be successfully 
modified when E satisfies additional conditions, it is unclear whether a similar approach may 
be applied effectively when S is unknown and arbitrary. Studying different statistics Ti, T2 
may provide additional insight into this problem, but other methodologies may be required 
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to handle more general E. 
Appendix 

Proof of Theorem 2 

Theorem 2 is an immediate consequence of the following lemma and its corollary. 
Lemma Al. Suppose that S = I . Then 



Var ( — I lyl 

n 



n 



2\2 
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8rf 15 15 



- < -a- 

n n 



+ — + 4+ — + , 
n n"^ n n'^ 



\n n J \n n 



(31) 



Proof. Equation (59) is obvious because ||y|P ~ (cr^ + t^)x^- To prove (60), we condition on 
X and use properties of expectations involving quadratic forms and normal random vectors 
to obtain 

Var(||X^y||2) = E {Yiii{\\X^y\\^\X)] +Ya.i {E{\\X'^y\\^\X)] 

= 2a^Eii {{X^Xf] + Aa^E {pF{X'^Xf(3] 
+Var [ahi^X'^X) + pF^X'^Xfji] 

= 2a^Eii{{X^Xf]+Aa^E{0^{X''Xf(3]+(j^E{ii{X^X)Y 

+2a^E {tr(X^X)/3^(X^X)2/3} + E {/3^{X^X f(3y - {Etr(X^X)}' 
-2a''EtT{X^X)E {f3^{X^XYf3} - [E {f3^{X^Xff3}Y . 

Given this expression for Var(| |X^y| p), (60) follows from Proposition SI in the Supplemental 
Text. Equation (61) is proved similarly: we have 



Cov(||y||M|X^y| 



i?{Cov(||y||M|X^y|nX)} + Cov{i?(||y|nX),i?(||X^y|nX)} 
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= 2a^Eii{X'^X)+Aa^E{^^{X^Xf^] 

+Cov {/3^X^X/3, ahi{X^X) + 0^{X^Xf(3] 

= 2a^Etr(X^X) + Aa^E {/3^(X^X)2/3} + a^E {ti{X^ X)0^ X^ XjS] 
+E {^^X^X/3/3^(X^X)2/3} - a^E (/3^X'^X/3) Etr(X^X) 
-E E {/3^(X^X)2/3} 

and (61) follows from Proposition SI in the Supplemental Text. □ 
Corollary Al. Under the conditions of Lemma 1, 



Var a' = — <^ - + l + ^ + - + ^ a^+ — + ^ + - + — 
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Proof. Corollary 1 follows from Lemma 1 and the fact that 

-^V^^ -^^\( n~-\\y\\- 



□ 



Proof of Theorem 3 



Theorem 3 is a direct application of Theorem 2.2 from (Chatterjee, 2009), which is stated 
here for ease of reference. 

Theorem Al. [Theorem 2.2, (Chatterjee, 2009)] Let v = {vi,...,VmY ~ A^(0, ^'). Suppose 
that g G C'^(W^) and let Vg and V'^g denote the gradient and the Hessian of g, respectively. 
Let 



= {E\\Vg{^W] 



41 1/4 
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= {Eiiv^^(v)ir}^/\ 

where ||V^(y'(v)|| is the operator norm of 'V'^g{\') . Suppose that Eg{\)^ < oo and let ifp' = 
Var{(yf(v)}. Let w be a normal random variable having the same mean and variance as g{v). 
Then 

dTv{gM,w} < — . (32) 

Remark 1. Chatterjee's Theorem 2.2 does not actually require Gaussian v. However, for non- 
Gaussian V, an additional term appears in the bound (32), which is not sufficiently small for 
our purposes. Furthermore, the class of distributions covered by the full version of Chatterjee's 
Theorem 2.2 is not all-encompassing: Vi must be a C^-function of a normal random variable. 

To prove Theorem 3, we apply Theorem Al with v = (X, e) G R('^+i)". Let h G C'^{R'^) and 

let 

g{X,e) = h{T), 

where T = T(X, e) = (n"-*^! |y| p, |X-^y| P)-^. First, we bound the quantities ki, K2 in 
Theorem Al. In order to bound ki, we compute the gradient of g. Let hi, /12 denote the 
partial derivatives of h with respect to the ffist and second variables, respectively. Then 



for i = I, n, j = 1, d. Let Eij denote the n x d matrix with z'j'-entry daidjji {6^/ = 1 if 
i = i' and otherwise). Since 

^ --\\' = 2f3^Ej;y 



and 



d 



dxij 



|X^y||2 = 2y^E,,X^y + 20^Ej.XX^y, 



it follows that 

^(X, 6) = 2h,{T) (^^i^Sy) + 2/..(T) (ly^E,,X^y + l^Ej^X^y ) . (33) 
For 1 < k < the partial derivative of g with respect to is given by 

^^-(X,6) = /^l(T)Al||y||2 + /,^(T)^J_||X^y|P 



defe dek n dek n 

1 \ 



n J \ 



2hi{T) -eiy + 2/.2(T) -eiXX'y , (34) 
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where G M" is the /c-th standard basis vector in M" (i.e. the fc'-th entry of is 5kk') and 
we have used the facts 

^l|y|| = 2e,y 
-^\\X^y\? = 2e,XX^y. 

Now recall that Ki = {E\\Vg{X, e)\\'^)^^^ . Equations (33)-(34) and the elementary inequality 

{a + bf <2a^ + 2b^, a,beR, (35) 

imply that 

i=i j=i ^ / 

k=l ' k = l ^ 

= ^(r^ + l)/^i(T)^||y|p + ^/^2(T)^ {||y|n|X^y|r + (r^ + HXX^yl^ 

Let Ai = ||n~^X-^X|| be the largest eigenvalue of n~^X'^X. Applying the triangle inequality 
and (35) yields 

\\VgiX,eW < ^(r^ + l)/.i(T)^(||X^X||r2 + ||e|n 



+l^/i2(T)2||X^X||(||X^X||V + ||e|r) 



16 

< — 

n 



+g/.2(T)^||X^X|p (r^ + (||X^X||r^ + ||e|p) 
||VMT)|p|8Ai (^ll^lpy + + 1) l||e||V + (lOA? + Ai) 
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+ (A? + l)-||e|p + (A? + Ai)r2 
n 



34 



< ^||VMT)||2(Ai + lf 



1 



n 



n 



e||^ + l +r^(r^ + l) . 



Thus, 



= {E\\VgiX,e)\\^) 
E 



1/4 



< 



264 

n 



||V/i(T)||^(Ai + If <^ -||e||^ _||e||^ + 1 + t^t' + 1) 



n 



n 



1/4 



o 



(36) 



where 



lk = E 



||VMT)|r(Ai + l)M-||6 



n 



To bound K2 = {E\\'V'^g{X, e)| |^}^^^, we bound the operatornorm of the Hessian {{V^giX, e) 



Let 



n d 



U = {U = (uU); u = Unf G M", f/ = (Mij)i<i<„, i<j<d, ^^l + ^Yl 

k=i 1=1 j=i 



= 1 



be the collection of partitioned n x (c? + 1) matrices with Frobenius norm equal to one. For 
U = {vlU) eU, define the differential operator 

n d r\ n rN 

i=l j=l ■' k=l 



Then 



snp Dlg{X, e) 



sup |v/i(T)^D|T(X,e) + {DfjTiX,e)fw'hiT)DfjT{X,e)] 



< sup { 1 1 VMT) 1 1 1 1 DlT{X, e) \ \ + 1 1 V^^T) 1 1 1 1 D^T(X, e) 1 1^ } . (37) 
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From our previous calculations, 

" d ( "'^IIIP \ " d ( "'^IIIP \ 

ip-U-y + f u-y 

^.y^UX^y + A/3^[/T^^Ty + 2 ^T^^Ty 



To compute D?^T(X, e), we need the second order partial derivatives of ||y|| and ||X y 
these are given below: 



12. 



dxiij'dxij 



de^dx 



y|| = 2ef^ek' 



dek'dek 



and 



axvSx-- "^' ^"' = 2/3^ y + 2/3^ y + 2y^'E,,E,^^.y 



+2y^Ei,X^E,j,(3 + 20^ElXEj,^,y + 20^ EJ^XX^ E,j,^, 



-T, 1 12 



dekdxij 
dek'dek 



\\X'y\\' = 2eiE,^X'y + 2y'E,jX'eu + 2(3'Et.XX'ek 



|X^y||2 = 2elXX^ek' 



for 1 < i, k < d and 1 < J < (i. It follows that the entries of iI'^T(X, e) are 

-M||y|p = Vf/^f//3 + Vf/^^u + -||u||^ 
n ^ n n n 

and 

^D^IlX^ylP = -^y^UU^y + \0'U^UX^y + \0^U^XU^y + -^0^U^XX^Uf3 
n n n n n 
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We conclude that 
\\DonX,e)\\' 



n 



^ [O^U^y + u^y)' + -i [y^UX^y + (i^U^XX^y + v^XX^y) 



12, 



< -,{r' + my\\' + -A\X'X\\{\\y\\'+\\X'X\y+\\X'X\\)\\y 



< l^(r2 + l)fAir2 + i||e|r 
n \ n 

+— Ai (a^t^It^ + 1) + i||e|p (\^ + -\\e\\' 
n \ n \ n 



O 



^ ^(A? + Ai)r2(r^ + l) + -||6|pfAi + i||e"2 



and 



\DlT{X,e)\\ < ^(r + l)^ + ^{||y|p + 4||X||(r + l)||y|| + ||X^X||(r+l)^} 

n \ n 



Combining (37)- (39), we obtain 



K,= {E\\W'g{X,e)\\'y/' = 



I {rir + + ^Tr\r' + 1) + iT + iT^r^ + 1) 



where 



Vk = E 



||V^MT)|r(Ai + l) 



12 



Appeahng to Theorem Al, the bounds (36) and (40) imply 



drv {giX, e),w} = 



j^3/2^2 



where 
and 



e = r^ r, rf, n) = 7]/^ + 72^/^ + 7o^/V(r + 1) 



^2 2 7 N 1/4 , 1/4 , 1/4 2/ 2 , IN I 1/4 , 1/4/ 2 , i\ 

This completes the proof of Theorem 3. 
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Proof outline for Proposition 2 

Let 

a [m) = a = <l + -, — , . > -\\y\\ j — — — ^||X y|| 



{n + l)m2 j n n{n + l)m2 

-2/ - \ ~2 (imx II ||2 \wT ||2 

r m = r = — , ... y + — —^\\Xy\\, 

n[n + l)m2 n[n + l)m2 

where m = (mi, 7712)"^- With m = (mi, 7712)"'" = {d~^tT{S),d''HT{S'^))'^ , consider the esti- 
mators o"^(m) and f^(m). Under the conditions of Proposition 2, Proposition SI from the 
Supplemental Text implies that E{mk — rrtkY = 0{n~'^), k = 1,2; furthermore, existing re- 
sults on the eigenvalues of Wishart matrices imply that Ein^'"'^^^''^ = 0(1) for r > sufficiently 
small (see, for example, the Appendix of (Dicker, 2012a); this is where the conditions that 
\n — d\ > 9 and d/n is bounded away from 1 are required). These facts can be combined to 
obtain 

E {a'(m) - ^^(m)}' = O (^-^^ and E {f\ih) - f\in)Y = O (^-^^ . (41) 

Additionally, it can be shown that 

E{d\in)] = a' + 0(A2) and E{f^{in)] = + 0(A2) (42) 

and 



Var{a^(m)} = ^ + O ) and Var{f^(m)} = ^ + O f | , (43) 



where Proposition SI in the Supplemental Text and the variance/covariance decompositions 
in the proof of Lemma Al are useful for proving (43). Part (i) of Proposition 2 (consistency) 
follows from (41)- (43). Part (ii) of Proposition 2 (asymptotic normality) also follows from 
(41)- (43), upon noticing that Theorem 3 may be applied to cr^(m) and f^(m), as in Corollary 
1. Asymptotic normality for f^/a^ follows from the delta method. 
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Supplemental text: Moment calculations for the Wishart distribution 

Suppose that X = (xi, x„)"^ is an nx d matrix with iid rows Xi, x„ ~ A^(0, S) and that 
E is a. d X d positive definite matrix. Then W = X'^X is a Wishart (n, E) random matrix. 
Let /3 G M'^. In this Supplemental Text we provide formulas for various moments involving 
W that are used in the paper. Letac and Massam (2004) and Graczyk et al. (2005) provide 
techniques for computing all such moments. These techniques are utilized here. 

The symmetric group and a formula for a class of moments involving W 

Let Sk denote the symmetric group on k elements. Then each permutation n E Sk can be 
uniquely as a product of disjoint cycles vr = Ci ■ ■ ■ Cm{n), where Cj = (cij • ■ ■ %-,), A;i + ■ ■ ■ + 
km{-!T) = k, and all of the Cij G {1, ...,k} are distinct. 

Let Hi, Hk he dx d symmetric matrices and define the polynomial 
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Theorem 1 in Letac and Massam (2004) and Proposition 1 in Graczyk et al. (2005) give the 
following formula: 

E{ii{WHi) ■■■ti{WHk)} = J2 2''"'"(^)n'"(^V^(r)(i7i,...,i7fc)- (44) 

This is our main tool for deriving the explicit formulas in the next section. 

Explicit moment formulas used in the paper 

For non-negative integers k, define = 0^ S^j3 and = d~^ti{S^). 
Proposition SI. We have 



EtiiW) 


= dnrrii 


(45) 


EiriWf 


= (fn'^ml + 2dnm2 


(46) 




= d'^nml -\- dn{n -\- 1)1712 


(47) 


E0^W(i 


= nrl 


(48) 


E0^W'^I3 


= dnrriiT^ + n{n + l)r2 


(49) 


E {ii{W)0^Wp] 


= dn^mirl + 2nr| 


(50) 


E {ii{W)0^W'^l3] 


= d'^n^mlri + dn{v? + n + 2)miT2 






+2dnm2Tl + 4n(n + 1)tI 


(51) 


E{l3'^Wf3f3'^W^f3) 


= dn{n + 2)miT^ + n{n + 2){n + 3)T'}T^ 


(52) 


E0^W^I3 


= d nm^Ti + 2dn{n + l)miT2 






+dn{n + l)m2rl + nin^ + 3n + A)tI 


(53) 


E{0^W^(3f 


= d^n{n + 2)mjrf + 2dn{n + 2){n + 3)mir^r2 
+2dn{n + 2)m2T^ + 4n(n + 2)(n + 3)rir| 






+n(n + l)(n + 2)(n + 3)r2^ 


(54) 



Proof. Formulas (45) and (48) are trivial (notice that 0^W[3 ~ TiXn)- Formulas (46)-(47) 
may be found in (Letac and Massam, 2004). 

Now let Ui, Urf G M'^ be an orthonormal basis of M^, with f3 = \ \(3\\ui. Define the d x d 
symmetric matrices Hij = (ujuj + u^uf )/2 and Hj = Hij, i,j = 1, d. Then 

d 

f3^W^f3 = tJ2 tiiWHj)'^. (55) 
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Since 5*2 = {(1 2), (1)(2)}, the formula (44) and Lemma 1 below imply 



To prove (49), observe that 

d 

i=i 

d d 

= n{n + 1) Tq (uf Z'uj)2 + n Tguf Z'uiuJZ'uj 

i=i i=i 
= n(n + 1)tI + dnrriiTl. 

For (50), equation (44) implies 

E {tr(iy)/3'^W^/3} = TlE{ti{W)ii{WHi)} 

= 2nT2 + dn^rriiTl. 

To prove (51), first notice that 

d 

E{ii{W)l3^W'^(3] = ro2^^{tr(l^)tr(l^/7j)2} 

and that (44) implies 

E {tT{W)tr{WH^)^} = ^23-™Wn™Wr^(r)(J,i7j-,i7j). 

It is clear that 

r(i 2 3)(i^)(/,^^„^,) = m 3 2){m.H,,H,) 
r(i 2)(3)(r)(/,ff„i/,) = r(i 3)(2)(r)(/,i/„i/,)- 

Thus, by Lemma 1, 

E{ii{W)ii{WH,f] = 8n^^2 3){mi,H„H,)+4n\,2)i3){mi,H„H,) 

+2nV(i)(2 3)(i:)(/,i^„i/,) + r^V(i)(2)(3)(r)(/,if„if,) 
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+2n\i{S)ti{UHjSHj) + nhi{E)ii{E H jf 
Combining this with (56) yields 

d d 

E {tr{W)f3'^W'^j3} = 2nTl uf r^uiuJ^Uj + ^nrl ruiujr^u^- 

d d 

+4n(n + l)rQ u^Z'^u^u^'Z'Uj + n^tr(Z')rQ u^Z'uiuJZ'Uj 

W(n + l)tr(r)ro' ^(uf ^u,-)' 

= (ira(ra^ + n + 2)mir| + ^dnm^Tl + 4:(fn{n + l)r3 + (fn^mlr^. 

The proof of (52) is similar to the proof of (51). By (44) and Lemma 1, 

E{tT{WH,)tT{WHjf} = 8nr(i 2 3)(i^)(i^i, i^,) + 4nV(i ^mimHi, Hj, H^) 

+2nV(i)(2 3)(r)(i7i,i/,-,i7,) +nV(i)(2)(3)(r)(i/i,i7„i/,) 
= 8ntr(ri7iri7jri7j) +4n2tr(ri7iri7j)tr(r^rj) 

+2nhr{EHi)tr{EHjSHj) + A(riJi)tr(rifj)2 
= 2n {(uf i;ui)2ujruj- + 3uf rui(uf ruj)2} + 4n2uf rui(uf ru 

W {(uf rui)2ujruj + uf rui(uf ruj)2} + n^uf rui(uf ru 

= n(n + 2)(uf ru^^ujruj- + ^(n^ + 5n + 6)uf rui(uf ruj)^, 

It follows that 

d 

E{(3^W(3(3^W'^(3) = ro^^tr(l^i/i)tr(Vr/7j)^ 



d d 

2 



n(n + 2) ^ ro^(u[rui)2uJi;Uj- + n{n^ + 5n + 6) Tq^u^ rui(uf 
(in(ra + 2)miTi + n(n^ + 5n + 6)r^r2 
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To prove (53), consider the decomposition 

d 

(3^W'^I3 = t-^{WHi)tT{WHj)ti{WHij). 
Equation(44) implies that 



Since 



i,j=l i,j=l 
d d 

i,j=l i,j=l 



it follows that 



Ef3^W'f3 = Snr^ ^ r(i 2 3){S){H„ Hj, Hi,) + 4n\^ ^ r(i)(2 3){S){H„ H„ H,,) 

i,j=l i,j=l 
d d 

i,j=l i,j=l 

By Lemma 1, 

d d 
i,j=l i,j=l 

-\-4:U^ UuiU^ UujuJ Euj } 

= - [d^mlrl + dm2Tl + 2(imir| + Arl) 
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d 
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I]) 



T, 



.2 <i 



To ^(^ 2){3){S){Hi, Hj, H, 



T, 



d 

Y tT{SHiSHj)ti{EHij) 





9 d 



«J=1 



^0 '"(l)(2)(3)(^)(^^,^i,^i: 



Tg X^ uf Z'Ujuf i7UjU^Z'Uj 



^J=l 



^3- 



Using these results with (57) we obtain 



= d^nmlrl + 2rfn(n + l)mir2 + + l)m2Tl + (n^ + 3n^ + 4n)r3 . 
Finally, we prove (54). Similar to the proof of (52)- (53), we have the decomposition 



By (44), 



E{tT{WHi)hT{WHjf} = J] 2^-"^Wn'"Wr^(r)(ifi,i7i,ifj-,i/^) 
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It follows that 



where 



One can easily see that 



'^(l 2 3 4) 


= ■'^(1 2 4 3) 


= "^(1 3 4 2) 


= "^(1 4 3 2) 


^"(1 3 2 4) 


= f{l A 2 3) 






r(l)(2 3 4) 


= r(i)(2 4 3) 


= r(i 3 4)(2) 


= r(i 4 3)(2) 




= f{l 3 2)(4) 


= f(l 2 4) (3) 


= ^(1 4 2)(3) 


^(1 3)(2 4) 


= r(i 4)(2 3) 






r{l 2){3)(4) 


= ^(1){2)(3 4) 






ni 3)(2)(4) 


= ^(1 4){2)(3) 


= ^(1)(3)(2 4) 


= ^(1)(4)(2 3) 



2 3)(4) 



Thus, 



32nr(i 234) + 16nr(i 324) + 32ra r(i)(2 3 4) + 8n r(i 3)(2 4) 

-,2-. . . . , /|„3~ , o^3~ , A~, 



+4n^f(i 2)(3 4) + 4ra-'f(i 2)(3)(4) + 8n''f(i 3)(2)(4) + n^r(i)(2)(3)(4)- 
It only remains to evaluate the f,r- It follows from Lemma 1 that 



^(1 2 3 4) = ^ TQti{EHiEHiEHjEHj 



d 4 



+ (u^Z'ui)^u^Z'UjUjZ'Uj + (u^Z'ui)^(uJ'Z'Uj)^} 
16 

r(1 3 2 4) = J] ToV^^.^^,^^.^^,) 
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= E|{(nrruO^(urru,)^ 

+6u^Z'uiU^Z'UjU^i7ujuJ"Z'Uj + (u^Z'ui)^(uJ'Z'Uj)^ j 
= \ + Qrlrl + dm2T^) 

d 

r{i)(2 3 4) = roV(ri7,)tr(rf/,rf/,ri7,) 

d 4 

+ 2 u 1 u^J" u j u j i7 u j I 

1 

4' 



d 



r(i3)(2 4) = ^T^ti{SHiEHjf 



«J=1 



— jU]^ ZujU;^ EvLj + ZuiUj Zujj 



r{i2)(3 4) = ^ roV(Z/f,Zif,)tr(Zi7,Z/f,; 



d 4 

^ ^ {(ufZui)' + u^Zuiuf Zu,} {(uf Zu,)2 + ZuiuJZu,} 
^(r| + 2dmirfr| + dX^f) 



r(i2)(3)(4) = ro^tr(Z^r,Zi7,)tr(Z/7,)^ 



I {(urZu,)^ + uf Zu.uf Zu.} «Zu ^2 
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d 



r{l 3)(2)(4) 



^ T^ti{EHiEHj)ti{EHi)ti{EHj) 




d 



r(l)(2)(3)(4) 



d 




Combining this with (58), we conclude that 



E{(3'^W^f3y = 32nf(i 234) + 16nr(i 324) + 32nV(i)(2 3 4) + 8nV(i 3)(2 4) + 4n2f(i 2)(3 4) 



+4raV(i 2)(3)(4) + 8ra'^r(i 3)(2){4) + nV(i)(2)(3)(4) 

= 2n{2T^ + QdniiT^T^ + 6rfr| + ^2^27-4 ^ ^/^^ri ) + 2ra(r2^ + Gr^r^ + dm2rf ) 

+8n2(r| + dmiT^rl + 2tIt^J + 2n2(r2^ + dm2T^ + 2rfr|) 

+n2(r2^ + 2dmiT^T^ + d^mlr^) + 2n3(r2^ + dniiT^T^) + 4n^(r2^ + t^t^) + n^r^ 
= {n^ + 6n^ + lln^ + 6n)r| + d{2n^ + lOn^ + 12n)miT^Tl 

+ (4n^ + 20n^ + 2An)T^T^ + d^{n^ + 2n)mlTf + d{2n^ + An)m2T^ 
= d^n{n + 2)m^rf + 2(in(n + 2){n + 3)miT^T^ + 2(in(n + 2)m2T^ 

+4n{n + 2){n + 3)t^t^ + n(n + l)(n + 2)(n + 3)r2^ 



□ 



Lemma SI. 



Let ui, G M*^ anc? define Hj = (uiuj + Ujuf )/2. For integers 1 < i, j < d 



(59) 
(60) 
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ti{EHiEHij) = - (u^ SuiUiSuj + u'f Z'Ujuf i7uj) (61) 

+uf r^Uiuf + ruiuf r^uj} (62) 

4 



+2uiruiUi rujufruj} (63) 

tT{IJ HiS Hj S Hij) = - \^uj SuiuJ SuiuJ Suj + uf Z'ui(uf Z'uj)^ 

+ (u^['Z'Uj)^uJZ'Uj + (u^Z'Uj)^uJ'i7uj 

+4u^IJuiU^IJujuJlJuj} (64) 
+Qu^ Suiu^ SuiU^ SujuJ Suj + 3uf Z'ui(uf Z'uj)^uf i7uj 

+(uf ruO'uf ruiu.ru,- + (uf rui)2(uf ru,)2} (65) 
ti{s H.SHjSHiSHj) = - {(ufru,)2(ufruj)2 + 6u^i;uiufruiufi;ujufi;uj 

8 

+(u]^rui)2(ufruj)2} (66) 

Proof. The identity (59) is trivial. To prove (60), we have 
tT(EHiSHj) = ^tr{r(uiuf + u,uf)r(uiuJ + Ujuf)} 

= -tr (Z'uiUj SuiUj + Z'uiUj SujU^ + SuiU^ SuiUj + SuiU^ SujU^ j 

Equation (61) follows from 
tT{UHiSH,j) = ^tr{i;(uiuf + Uiuf)r(uiuJ + u,uf)} 

= -tr (Z'uiuf Z'ujuJ + EuiuJ EujuJ + i7ujuf Z'ujuJ + Z'ujU^Z'Ujuf ) 
For (62), we have 

tiiS^HiSHj) = ^tr{r2(uiuf + Uiuf)r(uiuJ + Ujuf)} 
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= -tr (Z'^Uiuf Z'uiuJ + i7^Uiuf i7ujuf + Z'^Ujuf i7uiuj + i7^Uju!f Z'UjU^) 
= - (uf Z'UjU^Z'^Uj + Z'^Uiuf Z'uj + i7uiuf Z'^Uj + Z'^UjUiZ'Uj) . 
To prove (63)- (64), observe that 

ti^SHiSHjUHj) = -tr {Z'(uiuf + Ujuf )Z'(uiU"^ + u,u^)Z'(uiuJ' + Ujuf )| 

8 7 J 

= -tr (Z'uiuf Z'uiuJZ'uiuJ + Z'uiuf Z'uiuJZ'Ujuf 

-|-i7uiuf Z'Ujuf i7uiuj + Z'uiuf Z'Ujuf Z'Ujuf 
-l-Z'Ujuf Z'uiuJZ'uiuJ + Z'Ujuf Z'uiuJZ'ujuf 



+Z'UjU^Z'UjU^Z'uiuJ + EuiU^ SujU^ EnjU^^ 
- |u^Z'Uj(u-f i7uj)^ + u^Z'uiU^Z'UjuJZ'Uj + 2u^Z'uiU^['Z'UjuJ'Z'Uj| 



and 



ii{EHiEHjEHij) = -^tr {i7(uiuf + Ujuf )r(uiuj + u^uf )r(uiuj + u^uf)} 

= -tr (^EuiuJ EuiuJ EuiuJ + Z'uiuf Z'uiuJZ'ujuf 

+i7uiuf Z'UjU^Z'UjuJ + EuiuJ EujuJ EujuJ 
+Z'ujuf Z'uiuJZ'ujuJ + i7ujuf Z'uiuJZ'ujuf 
+Z'UjU^Z'UjU^Z'UjuJ + Z'Uju]"i7ujU^Z'UjuJ") 

= - \^u^ EuiuJ EuiuJ Euj + u^Z'ui(uJ"Z'Uj)^ + (u]"Z'Uj)^uJZ'Uj 
Finally, to prove (65)-(66), we have 



tr{EHiEHiEHjEHj) 
1 

~ 16 
1 

~ 16 



— tr |Z'(uiuf + UjU^)Z'(uiuJ' + Ujuf )i7(uiuj + UjU"[')i7(uiuJ + u^u^)} 
tr (i7uiuf i7uiuf Z'uiuJZ'uiuJ' + Z'uiuf i7uiuf Z'uiuJZ'ujuf 



+Z'uiUj i7uiUj i7ujU^ Z'uiUj + Euiu^ i^UiUj EujU^ EujU^ 
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+i7uiuf Z'ujuf Z'ujuf Z'uiuJ + Z'uiuf Z'ujU^Z'ujU^Z'uju]" 
H-i^UjU'f Z'uiuf Z'uiuJZ'uiuJ + Z'ujuf Z'uiuf Z'uiuJZ'ujuf 
+Z'ujuf Z'uiuf i7ujuf Z'uiuJ + Z'ujuf Z'uiuf Z'ujuf Z'ujU^ 
+Z'ujU^Z'UjU^Z'uiuJZ'uiuJ + Z'UjU^Z'UjU^Z'uiuJZ'UjU^ 
+X'UjU^Z'UjU^i7ujU^Z'uiuJ + i7ujuf Z'ujuf i7ujU^i7ujuf ) 

= — |2(u^Z'Uj)^(u^Z'Uj)^ + 3u^Z'ui(u^Z'Uj)^uJZ'Uj 

+6uf i7uiu^i7ujuf Z'ujuf Z'uj + Su'f Z'ui(uf Z'uj)^uJ'Z'uj 

and 

= ^tr {Z'(uiuf + Ujuf )Z'(uiuJ + Ujuf )Z'(uiuf + Ujuf )r(uiuj + u^-uf)} 

= Yg^^ (i^Uiuf Z'uiuJZ'uiuf Z'uiuJ + i7uiuf Z'uiuJZ'uiuf Z'ujU^ 

+Z'uiuJ'i7uiuJZ'ujU^Z'uiuJ + Z'uiuJ'Z'uiuJZ'ujU^Z'ujU^ 
H-Z'uiuf i7Ujuf Z'uiuf Z'uiuJ + Z'uiuf i7ujuf Z'uiuJ'Z'UjU^ 
+Z'uiUj i7ujU^ SuiVL]^ EuixVj + Z'uiUj Z'ujUj^ SujU-^ 
+i7ujU^Z'uiuJZ'uiuJ"Z'uiuJ + Z'UjU^J'Z'uiuJZ'uiU^Z'UjU^ 
+Z'ujuf Z'uiuJZ'ujuf Z'uiuJ + Z'ujU^Z'uiuJZ'ujU^Z'ujU^ 
+Z'u.juf Z'ujuf Z'uiuf Z'uiuJ + Z'u,jU^i7uju]"Z'uiuf Z'ujuf 
+Z'UjU^Z'UjU^Z'UjU^Z'uiuJ + Z'UjU^Z'UjU^Z'u.jU^Z'UjU^) 

8 

□ 



